-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: storage: fatal on corruption encountered in background #102274
Conversation
Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies. Now, on-disk corruption results in immediately exiting the node. Epic: none Fixes: #101101 Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately.
6cdac3c
to
b2c07b6
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFTR! |
Backport 1/1 commits from #102252 on behalf of @jbowens.
/cc @cockroachdb/release
Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies.
Now, on-disk corruption results in immediately exiting the node.
Epic: none
Fixes: #101101
Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately.
Release justification: Very low-risk change to resolve an issue that can affect cluster stability.